NullSeq: A Tool for Generating Random Coding Sequences with Desired Amino Acid and GC Contents

نویسندگان

  • Sophia S. Liu
  • Adam J. Hockenberry
  • Andrea Lancichinetti
  • Michael C. Jewett
  • Luis A. Nunes Amaral
چکیده

The existence of over- and under-represented sequence motifs in genomes provides evidence of selective evolutionary pressures on biological mechanisms such as transcription, translation, ligand-substrate binding, and host immunity. In order to accurately identify motifs and other genome-scale patterns of interest, it is essential to be able to generate accurate null models that are appropriate for the sequences under study. While many tools have been developed to create random nucleotide sequences, protein coding sequences are subject to a unique set of constraints that complicates the process of generating appropriate null models. There are currently no tools available that allow users to create random coding sequences with specified amino acid composition and GC content for the purpose of hypothesis testing. Using the principle of maximum entropy, we developed a method that generates unbiased random sequences with pre-specified amino acid and GC content, which we have developed into a python package. Our method is the simplest way to obtain maximally unbiased random sequences that are subject to GC usage and primary amino acid sequence constraints. Furthermore, this approach can easily be expanded to create unbiased random sequences that incorporate more complicated constraints such as individual nucleotide usage or even di-nucleotide frequencies. The ability to generate correctly specified null models will allow researchers to accurately identify sequence motifs which will lead to a better understanding of biological processes as well as more effective engineering of biological systems.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

GENERATING FUZZY RULES FOR PROTEIN CLASSIFICATION

This paper considers the generation of some interpretable fuzzy rules for assigning an amino acid sequence into the appropriate protein superfamily. Since the main objective of this classifier is the interpretability of rules, we have used the distribution of amino acids in the sequences of proteins as features. These features are the occurrence probabilities of six exchange groups in the seque...

متن کامل

A general trend for invertebrate mitochondrial genome evolution

Exploring DNA and protein evolution is very basic research. Here, 60 sets of invertebrate mitochondria were selected; whose entire genome sequences, protein-coding DNA and proteins sequences evolution were studied. The results were interesting. For the whole mitochondrial genome evolution, the AT content is increasing and GC is reducing during the life evolution going. The net ratio of AT conte...

متن کامل

Isolation and molecular characterization of partial FSH and LH receptor genes in Arabian camels (Camelus dromedarius)

Very little is known about LHR and FSHR genes of domestic dromedary camels. The main objective of this study was to determine and analyze partial genomic regions of FSHR and LHR genes in dromedary camels for the first time. To this end, a total of 50 DNA samples belonging to dromedary camels raised in Iran were sent for sequencing (25 samples of each gene). We compared the nucleotide sequences ...

متن کامل

Nucleotide sequence of a rice acidic ribosomal phosphoprotein P0 cDNA.

The eukaryotic 38-kD acidic ribosomal protein PO is localized to the stalk of the 60s ribosomal subunit, forming a pentameric complex with dimers of 13-kD acidic ribosomal proteins P1 and P2 (Rich and Steitz, 1987). P1 (P2) and PO are analogous to Escherichia coli ribosomal protein L7/L12, and L10, respectively, and are thought to have the same functions as their prokaryotic counterparts, i.e. ...

متن کامل

A Content-Centric Organization of the Genetic Code

The codon table for the canonical genetic code can be rearranged in such a way that the code is divided into four quarters and two halves according to the variability of their GC and purine contents, respectively. For prokaryotic genomes, when the genomic GC content increases, their amino acid contents tend to be restricted to the GC-rich quarter and the purine-content insensitive half, where a...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره 12  شماره 

صفحات  -

تاریخ انتشار 2016